Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits
نویسندگان
چکیده
Regret bounds in online learning compare the player’s performance to L∗, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon T . The more refined concept of first-order regret bound replaces this with a scaling √ L∗, which may be much smaller than √ T . It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem Agarwal et al. [2017], Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.
منابع مشابه
Open Problem: First-Order Regret Bounds for Contextual Bandits
We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of Õ( √ L?K lnN) where there areK actions,N policies, andL? is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret Õ(L ? poly(K, ln(N/δ))). We describe some positive results, such as an ineff...
متن کاملPAC-Bayesian Analysis of Contextual Bandits
We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The scaling of our regret bound with the number of states (contexts) N goes as
متن کاملOnline Clustering of Contextual Cascading Bandits
We consider a new setting of online clustering of contextual cascading bandits, an online learning problem where the underlying cluster structure over users is unknown and needs to be learned from a random prefix feedback. More precisely, a learning agent recommends an ordered list of items to a user, who checks the list and stops at the first satisfactory item, if any. We propose an algorithm ...
متن کاملThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits
We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales asO(T S) or better (sometimes, much better). Here S is th...
متن کاملAlgorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The budget and time constraints significantly increase the complexity of exploration-exploitation tradeoff because they introduce coupling among contexts. Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.03386 شماره
صفحات -
تاریخ انتشار 2018